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Abstract 

We present a second iteration of a machine learning approach to static code analysis and 
fingerprinting for weaknesses related to security, software engineering, and others using the 
open-source MARF framework and the MARFCAT application based on it for the NIST's 
SATE IV static analysis tool exposition workshop's data sets that include additional test 
cases, including new large synthetic cases. To aid detection of weak or vulnerable code, 
including source or binary on different platforms the machine learning approach proved to 
be fast and accurate to for such tasks where other tools are either much slower or have much 
smaller recall of known vulnerabilities. We use signal and NLP processing techniques in our 
approach to accomplish the identification and classification tasks. MARFCAT's design from 
the beginning in 2010 made is independent of the language being analyzed, source code, 
bytecode, or binary. In this follow up work with explore some preliminary results in this 
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1 Introduction 

This is a follow up work on the first incarnation of MARFCAT detailed in [Mokl0d[ IMokllj . 
Thus, the majority of the results content here addresses the newer iteration duplicating only 
the necessary background and methodology information (reduced). The reader is deferred to 
consult the expanded background information and results in that previous work freely accessible 
online (and the arXiv version of that is still occasionally updated). 
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We elaborate on the details of the expanded methodology and the corresponding results 
of application of the machine learning techniques along with signal and NLP processing to 
static source and binary code analysis in search for weaknesses and vulnerabilities. We use 
the tool, named MARFCAT, a MARF-based Code Analysis Tool [Mokl2j . first exhibited at 
the Static Analysis Tool Exposition (SATE) workshop in 2010 [ODBN10] to machine-learn 
from the (Common Vulnerabilities and Exposures) CVE-based vulnerable as well as synthetic 
CWE-based cases to verify the fixed versions as well as non-CVE based cases from the projects 
written in same programming languages. The 2 nd iteration of this work was prepared for SATE 
IV [ODBN12] and uses its updated data set and application. 

On the NLP side, we employ simple classical NLP techniques (n-grams and various smoothing 
algorithms), also combined with machine learning for novel non-NLP applications of detection, 
classification, and reporting of weaknesses related to vulnerabilities or bad coding practices 
found in artificial constrained languages, such as programming languages and their compiled 
counterparts. We compare and contrast the NLP approach to the signal processing approach in 
our results summary and illustrate concrete results and for the same test cases. 

We claim that the presented machine learning approach is novel and highly beneficial in 
static analysis and routine testing of any kind of code, including source code and binary de- 
ployments for its efficiency in terms of speed, relatively high precision, robustness, and being a 
complimentary tool to other approaches that do in-depth semantic analysis, etc. by prioritiz- 
ing those tools' targets. All that can be used in automatic manner in distributed and scalable 
diverse environments to ensure the code safety, especially the mission critical software code in 
all kinds of systems. It uses spectral, acoustic and language models to learn and classify such a 
code. 

This document, like its predecessor, is a "rolling draft" with several updates expected to be 
made as the project progresses beyond SATE IV. It is accompanied with the updates to the 
open-source MARFCAT tool itself |Mokl2j . 

Organization 

The related work, some of the present methodology is based on, is referenced in Section [2} 
The methodology summary is in Section |4j We present some of the results in Section [5] from 
the SAM ATE reference test data set. Then we present a brief summary, description of the 
limitations of the current realization of the approach and concluding remarks in Section [6j In 
the Appendix there are classification result tables for specific test cases illustrating top results 
by precision. 

2 Related Work 

To our knowledge this was the first time a machine learning approach was attempted to static 
code analysis with the first results demonstrated during the SATE2010 workshop [M oklOd] 
IMokl2l [QDBN10] . In the same year, a somewhat similar approach independently was presented 
[BSSV10J for vulnerability classification and prediction using machine learning and SVMs, but 
working with a different set of data. 

Additional related work (to various degree of relevance or use) is further listed (this list is 
not exhaustive). A taxonomy of Linux kernel vulnerability solutions in terms of patches and 
source code as well as categories for both are found in [MLB07]. The core ideas and principles 
behind the MARF's pipeline and testing methodology for various algorithms in the pipeline 
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adapted to this case are found in [Mok08bl IMoklOb] as it was the easiest implementation avail- 
able to accomplish the task. There also one can find the majority of the core options used 
to set the configuration for the pipeline in terms of algorithms used. A binary analysis using 
machine learning approach for quick scans for files of known types in a large collection of files 
is described in [MD08j as well as the NLP and machine learning for NLP tasks in DEFT2010 
[MoklOcl IMoklOa] with the corresponding DEFT2010App and its predecessor for hand-written 
image processing WriterldentApp [MSS09J. Tlili's 2009 PhD thesis covers topics on automatic 
detection of safety and security vulnerabilities in open source software |Tli09| . Statistical analy- 
sis, ranking, approximation, dealing with uncertainty, and specification inference in static code 
analysis are found in the works of Engler's team |KTB+06l IKAYE041 IKE03] . Kong et al. further 
advance static analysis (using parsing, etc.) and specifications to eliminate human specification 
from the static code analysis in |KZL10j . Spectral techniques are used for pattern scanning in 
malware detection by Eto et al. in [ESI + 09j . Some researchers propose a general data mining 
system for incident analysis with data mining engines in |IYE + Q9 . Hanna et al. describe a syn- 
ergy between static and dynamic analysis for the detection of software security vulnerabilities 
in [HLYD09] paving the way to unify the two analysis methods. Other researchers propose a 
MEDUSA system for metamorphic malware dynamic analysis using API signatures in |NJG + l6"] . 
Some of the statistical NLP techniques we used, are described at length in |MS02| . BitBlaze (and 
its web counterpart, WebBlaze) are other recent types of tools that to static and dynamic binary 
code analysis for vulnerabilities fast, developed at Berkeley [SonlOat ISonlObj . For wavelets, for 



example, Li et al. LjXP + 09 have shown wavelet transforms and /c-means classification can be 
used to identify communicating applications on a network fast and is relevant to our study of 
the code in any form, text or binary. 



3 Data Sets 

We use the SAMATE data set to practically validate our approach. The SAMATE reference 
data set contains C/C++, Java, and PHP language tracks comprising CVE-selected cases as 
well as stand-alone cases and the large generated synthetic C and Java test cases (CWE-based, 
with a lot of variants of different known weaknesses). SATE IV expanded some cases from 
SATE2010 by increasing the version number, and dropped some other cases (e.g. Chrome). 

The C/C++ and Java test cases of various client and server OSS software are compilable 
into the binary and object code, while the synthetic C and Java cases generated for various 
CWE entries provided for greater scalability testing (also compilable). The CVE-selected cases 
had a vulnerable version of a software in question with a list of CVEs attached to it, as well 
as the most known fixed version within the minor revision number. One of the goals for the 
CVE-based cases is to detect the known weaknesses outlined in CVEs using static code analysis 
and also to verify if they were really fixed in the "fixed version" [ODBN12J. The cases with 
known CVEs and CWEs were used as the training models described in the methodology. The 
summary below is a union of the data sets from SATE2010 and SATE IV. 

The preliminary list of the CVEs that the organizers expect to locate in the test cases were 
collected from the NVD |NIS12al IODBN12| for |Wireshark OIol |Dovecot[ |Tomcat 5.5.131 |Jetty| 



6.1.16, and Wordpress 2.0, 



The specific test cases with versions and language at the time included CVE-selected: 

• C: Wireshark 1.2.0 (vulnerable) and Wireshark 1.2.18 (fixed, up from Wireshark 1.2.9 in 
SATE2010) 
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• C: Dovecot (vulnerable) and Dovecot (fixed) 

• C++: Chrome 5.0.375.54 (vulnerable) and Chrome 5.0.375.70 (fixed) 

• Java: Tomcat 5.5.13 (vulnerable) and Tomcat 5.5.33 (fixed, up from Tomcat 5.5.29 in 
SATE2010) 

• Java: Jetty 6.1.16 (vulnerable) and Jetty 6.1.26 (fixed) 

• PHP: Wordpress 2.0 (vulnerable) and Wordpress 2.2.3 (fixed) 

originally non-CVE selected in SATE2010: 

• C: Dovecot 

• Java: Pebble 2.5-M2 

Synthetic CWE cases produced by the SAMATE team: 

• C: Synthetic C covering 118 CWEs and « 60K files 

• Java: Synthetic Java covering ~ 50 CWEs and ~ 20K files 

4 Methodology 

In this section we outline the methodology of our approach to static source code analysis. Most 
of this methodology is an updated description from [MoklOdj . The line number determination 
methodology is also detailed in [MoklOd, ODBN10], but is not replicated here. Thus, the 



methodology's principles overview is described in Section 4.1 the knowledge base construction 



is in Section 4.2, machine learning categories in Section 4.3, and the high-level algorithmic 



description is in Section 4.4 



4.1 Methodology Overview 

The core methodology principles include: 

• Machine learning and dynamic programming 

• Spectral and signal processing techniques 

• NLP n-gram and smoothing techniques (add-<5, Witten-Bell, MLE, etc.) 

We use signal processing techniques, i.e. presently we do not parse or otherwise work at the 
syntax and semantics levels. We treat the source code as a "signal", equivalent to binary, where 
each n-gram (n = 2 presently, i.e. two consecutive characters or, more generally, bytes) are used 
to construct a sample amplitude value in the signal. In the NLP pipeline, we similarly treat the 
source code as a "characters", where each n-gram (n = 1..3) is used to construct the language 
model. 

We show the system the examples of files with weaknesses and MARFCAT learns them 
by computing spectral signatures using signal processing techniques or various language mod- 
els (based on options) from CVE-selected test cases. When some of the mentioned techniques 
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are applied (e.g. filters, silence/noise removal, other preprocessing and feature extraction tech- 
niques), the line number information is lost as a part of this process. 

When we test, we compute either how similar or distant each file is from the known trained- 
on weakness-laden files or compare trained language models with the unseen language fragments 
in the NLP pipeline. In part, the methodology can approximately be seen as some signature- 
based antivirus or IDS software systems detect bad signature, except that with a large number 
of machine learning and signal processing algorithms, we test to find out which combination 
gives the highest precision and best run-time. 

At the present, however, we are looking at the whole files instead of parsing the finer- grain 
details of patches and weak code fragments. This aspect lowers the precision, but is relatively 
fast to scan all the code files. 



4.2 CVEs and CWEs - the Knowledge Base 

The CVE-selected test cases serve as a source of the knowledge base to gather information of 
how known weak code "looks like" in the signal form jMoklOd] . which we store as spectral 
signatures clustered per CVE or CWE (Common Weakness Enumeration). The introduction by 
the SAM ATE team of a large synthetic code base with CWEs, serves as a part of knowledge 
base learning as well. Thus, we: 

• Teach the system from the CVE-based cases 

• Test on the CVE-based cases 

• Test on the non-CVE-based cases 

For synthetic cases we do similarly: 

• Teach the system from the CWE-based synthetic cases 

• Test on the CWE-based synthetic cases 

• Test on the CVE and non-CVE-based cases for CWEs from synthetic cases 



We create index files in XML in the format similar to that of SATE to index all the file of 
the test case under study. The CVE-based cases after the initial index generation are manually 
annotated from the NVD database before being fed to the system. The script that does the 
initial index gathering in the OSS distribution of MARFCAT is called collect-f iles-meta. 



pi written in Perl. The synthetic cases required a special modification to that resulting in 



collect-files-meta-synthetic.pl where there are no CVEs to fill in but CWEs alone, with 



the auto-prefilled explanations since the information in the synthetic cases is not arbitrary and 
controlled for identification. 



4.3 Categories for Machine Learning 

The tow primary groups of classes we train and test on include are naturally the CVEs [NIS12a, 
NIS12b] and CWEs |VM12j. The advantages of CVEs is the precision and the associated meta 
knowledge from [NIS12a| INIS12bj can be all aggregated and used to scan successive versions of 
the the same software or derived products (e.g. WebKit in multiple browsers). CVEs are also 
generally uniquely mapped to CWEs. The CWEs as a primary class, however, offer broader 
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categories, of kinds of weaknesses there may be, but are not yet well assigned and associated 
with CVEs, so we observe the loss of precision. Since we do not parse, we generally cannot 
deduce weakness types or even simple-looking aspects like line numbers where the weak code 
may be. So we resort to the secondary categories, that are usually tied into the first two, which 
we also machine-learn along, such as issue types (sink, path, fix) and line numbers. 

4.4 Algorithms 

In our methodology we systematically test and select the best (a tradeoff between speed and 
accuracy) combination(s) of the algorithm implementations available to us and then use only 
those for subsequent testing. This methodology is augmented with the cases when the knowledge 
base for the same code type is learned from multiple sources (e.g. several independent C test 
cases). 

4.4.1 Signal Pipeline 

Algorithmically-speaking, the steps that are performed in the machine-learning signal based 
analysis are in Figure [TJ The specific algorithms come from the classical literature and other 
sources and are detailed in [Mok08b] and the related works. To be more specific for this work, 
the loading typically refers to the interpretation of the files being scanned in terms of bytes 
forming amplitude values in a signal (as an example, 8kHz or 16kHz frequency) using either uni- 
gram, bi-gram, or tri-gram approach. Then, the preprocessing allows to be none at all ("raw", 
or the fastest), normalization, traditional frequency domain filters, wavelet-based filters, etc. 
Feature extraction involves reducing an arbitrary length signal to a fixed length feature vector 
of what thought to be the most relevant features are in the signal (e.g. spectral features in FFT, 
LPC), min-max amplitudes, etc. Classification stage is then separated either to train by learning 
the incoming feature vectors (usually fc-means clusters, median clusters, or plain feature vector 
collection, combined with e.g. neural network training) or testing them against the previously 
learned models. 

4.4.2 NLP Pipeline 

The steps that are performed in NLP and the machine-learning based analysis are presented 
in Figure [2] The specific algorithms again come from the classical literature (e.g. (MS02i) and 
are detailed in [MoklOb] and the related works. To be more specific for this work, the loading 
typically refers to the interpretation of the files being scanned in terms of n-grams: uni-gram, 
bi-gram, or tri-gram approach and the associated statistical smoothing algorithms, the results 
of which (a vector, 2D or 3D matrix) are stored. 

4.5 Binary and Bytecode Analysis 

In this iteration we also perform preliminary Java bytecode and compiled C code static analysis 
and produce results using the same signal processing, NLP, combined with machine learning 
and data mining techniques. At this writing, the NIST SAMATE synthetic reference data 
set for Java and C was used. The algorithms presented in Section |4.4| are used as-is in this 
scenario with the modifications to the index files. The modifications include removal of the 
line numbers, source code fragments, and lines-of-text counts (which are largely meaningless 
and ignored. The byte counts may be recomputed and capturing a byte offset instead of a line 
number was projected. The filenames of the index files were updated to include -bin in them 
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// Construct an index mapping CVEs to files and locations within files 

1 Compile meta-XML index files from the CVE reports (line numbers, CVE, CWE, 
fragment size, etc.). Partly done by a Perl script and partly annotated manually; 

2 foreach source code base, binary code base do 

// Presently in these experiments we use simple mean clusters of 
feature vectors or unigram language models per default MARF 
specification C [Mo k08bl , IThel2ll ) 

3 Train the system based on the meta index files to build the knowledge base (learn); 

4 begin 

5 Load (interpret as a wave signal or n — gram) ; 

6 Preprocess (none, FFT- filters, wavelets, normalization, etc.); 

7 Extract features (FFT, LPC, min-max, etc.); 

8 Train (Similarity, Distance, Neural Network, etc.); 

9 end 

10 Test on the training data for the same case (e.g. Tomcat 5.5.13 on Tomcat 5.5.13) 
with the same annotations to make sure the results make sense by being high and 
deduce the best algorithm combinations for the task; 
n begin 

12 Load (same); 

13 Preprocess (same); 

14 Extract features (same); 

15 Classify (compare to the trained fc-means, or medians, or language models); 

16 Report; 

17 end 

18 Similarly test on the testing data for the same case (e.g. Tomcat 5.5.13 on Tomcat 
5.5.13) without the annotations sanity check; 

19 Test on the testing data for the fixed case of the same software (e.g. Tomcat 5.5.13 on 
Tomcat 5.5.33); 

20 Test on the testing data for the general non-CVE case (e.g. Tomcat 5.5.13 on Pebble 
or synthetic); 

21 end 

Figure 1: Machine-learning-based static code analysis testing algorithm using the signal 
pipeline 



to differentiate from the original index files describing the source code. Another point is at the 
moment the simplifying assumption is that each compilable source file e.g. . j ava or . c produce 
the corresponding . class and . o files that we examine. We do not examine inner classes or 
linked executables or libraries at this point. 

4.6 Wavelets 

As a part of a collaboration project with Dr. Yankui Sun from Tsinghua University, wavelet- 
based signal processing for the purposes of noise filtering is being introduced with this work to 
compare it to no-filtering, or FFT-based classical filtering. It's been also shown in |LjXP + 09| that 
wavelet-aided filtering could be used as a fast preprocessing method for a network application 
identification and traffic analysis [LKW08 . 
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1 Compile meta-XML index files from the CVE reports (line numbers, CVE, CWE, 
fragment size, etc.). Partly done by a Perl script and partly annotated manually; 

2 foreach source code base, binary code base do 

// Presently in these experiments we use simple unigram language models 
per default MARF specification ( IIMoklObll ) 

3 Train the system based on the meta index files to build the knowledge base (learn); 

4 begin 

5 Load (ra-gram); 

6 Train (statistical smoothing estimators); 

7 end 

8 Test on the training data for the same case (e.g. Tomcat 5.5.13 on Tomcat 5.5.13) 
with the same annotations to make sure the results make sense by being high and 
deduce the best algorithm combinations for the task; 

9 begin 

10 Load (same); 

n Classify (compare to the trained language models); 

12 Report; 

13 end 

14 Similarly test on the testing data for the same case (e.g. Tomcat 5.5.13 on Tomcat 
5.5.13) without the annotations sanity check; 

15 Test on the testing data for the fixed case of the same software (e.g. Tomcat 5.5.13 on 
Tomcat 5.5.33); 

16 Test on the testing data for the general non-CVE case (e.g. Tomcat 5.5.13 on Pebble 
or synthetic); 

17 end 

Figure 2: Machine- learning-based static code analysis testing algorithm using the NLP 
pipeline 



We rely in part on the the algorithm and methodology found in |AS01|. ISCL + 03] IKBC051 
KBC06], and at this point only a separating ID discrete wavelet transform (SDWT) has been 



tested (see Section 5.4.1). 

Since the original wavelet implementation |SCL + 03j is in MATLAB |Matl2al ISch07j . we 
used in part the codegen tool from the MATLAB Coder toolbox [M atl2bl IMatl2cj to generate 
a rough C/C++ equivalent in order to (manually) translate some fragments into Java (the 
language of MARF and MARFCAT). The specific function for up/down sampling used by the 
wavelets function in [Mot09j written also C/C++ was translated to Java in MARF as well with 
unit tests added. 

4.7 Demand-Driven Distributed Evaluation with GIPSY 

To enhance the scalability of the approach, we convert the MARFCAT stand-alone application to 
a distributed one using an eductive model of computation (demand-driven) implemented in the 
General Intensional Programming System (GIPSY) 's multi-tier run-time system jHanlOl iJilll 
IVas05| P_aq09|, which can be executed distributively using Jini (Apache River), or JMS |JMP12j . 



To adapt the application to the GIPSY's multi-tier architecture, we create a problem-specific 
generator and worker tiers (PS-DGT and PS-DWT respectively) for the MARFCAT application. 
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The generator (s) produce demands of what needs to be computed in the form of a file (source 
code file or a compiled binary) to be evaluated and deposit such demands into a store managed 
by the demand store tier (DST) as pending. Workers pickup pending demands from the store, 
and them process then (all tiers run on multiple nodes) using a traditional MARFCAT instance. 
Once the result (a Warning instance) is computed, the PS-DWT deposit it back into the store 
with the status set to computed. The generator "harvests" all computed results (warnings) and 
produces the final report for a test cases. Multiple test cases can be evaluated simultaneously 
or a single case can be evaluated distributively. This approach helps to cope with large amounts 
of data and avoid recomputing warnings that have already been computed and cached in the 
DST. 

The initial basic experiment assumes the PS-DWTs have the training sets data and the test 
cases available to them from the start (either by a copy or via an NFS / CIFS-mounted volumes) ; 
thus, the distributed evaluation only concerns with the classification task only as of this version. 
The follow up work will remove this limitation. 

In this setup a demand represents a file (a path) to scan (actually a an instance of the 
Fileltem object), which is deposited into the DST. The PS-DWT picks up that and checks 
the file per training set that's already there and returns a ResultSet object back into the DST 
under the same demand signature that was used to deposit the path to scan. The result set 
is sorted from the most likely to the list likely with a value corresponding to the distance or 
similarity. The PS-DGT picks up the result sets and does the final output aggregation and saves 



report in one of the desired report formats (see Section 4.8 picking up the top two results from 
the result set and testing against a threshold to accept or reject the file (path) as vulnerable or 
not. This effectively splits the monolithic MARFCAT application in two halves in distributing 
the work to do where the classification half is arbitrary parallel. 
Simplifying assumptions: 

• Test case data and training sets are present on each node (physical or virtual) in ad- 
vance (via a copy or a CIFS or NFS volume), so no demand driven training occurs, only 
classification 

• The demand assumes to contain only file information to be examined (Fileltem) 

• PS-DWT assumes a single pre-defined configuration, i.e. configuration for MARFCAT's 
option is not a part of the demand 

• PS-DWT assume CVE or CWE testing based on its local settings and not via the config- 
uration in a demand 



4.8 Export 
4.8.1 SATE 

By default MARFCAT produces the report data in the SATE XML format, according to the 
SATE IV requirements. In this iteration other formats are being considered and realized. To 
enable multiple format output, the MARFCAT report generation data structures were adapted 
case-based output. 



4.8.2 Forensic Lucid 

The first one, is Forensic Lucid, the author Mokhov's PhD topic, a language to specify and 
evaluate digital forensic cases by uniformly encoding the evidence and witness accounts (eviden- 
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tial statement or knowledge base) of any case from multiple sources (system specs, logs, human 
accounts, etc.) as a description of an incident to further perform investigation and event recon- 
struction. Following the data export in Forensic Lucid in the preceding work [MPD08. IMPD10"! 
Mok08a] we use it asa format for evidential processing of the results produced by MARFCAT. 
The work [MPD08 ] provides details of the language; it will suffice to mention here that the report 
generated by MARFCAT in Forensic Lucid is a collection of warnings as observations with the 
hierarchical notion of nested context of warning and location information. These will form an 
evidential statement in Forensic Lucid. The example scenario where such evidence compiled via 
a MARFCAT Forensic Lucid report would be in web-based applications and web browser-based 
incident investigations of fraud, XSS, buffer overflows, etc. linking CVE/CWE-based evidence 
analysis of the code (binary or source) security bugs with the associated web-based malware 
propagation or attacks to provide possible events where specific attacks can be traced back to 
the specific security vulnerabilities. 

4.8.3 SAFES 

The third format, for which the export functionality is not done as of this writing, SAFES, is 
the 3rd format for output of the MARFCAT. SAFES is becoming a standard to reporting such 
information and the SATE organizers began endorsing it as an alternative during SATE IV. 

4.9 Experiments 

The below is the current summary of the conducted experiments: 

• Re-testing of the newer fixed versions such as Wireshark 1.2.18 and Tomcat 5.5.33. 

• Half-based testing of the previous versions by reducing the training set by half and but 
testing for all known CVEs or CWEs for Wireshark 1.2.18, Tomcat 5.5.33, and Chrome 
5.0.375.54. 

• Testing the new test cases of Dovecot, Jetty 6.1.x, and Wordpress 2.x as well as Synthetic 
C and Synthetic Java. 

• Binary test on the Synthetic C and Synthetic Java test cases. 

• Performing tests using wavelets for preprocessing. 

5 Results 

The preliminary results of application of our methodology are outlined in this section. We 
summarize the top precisions per test case using either signal-processing or NLP-processing 
of the CVE-based and synthetic cases and their application to the general cases. Subsequent 
sections detail some of the findings and issues of MARFCAT's result releases with different 
versions. Some experiments we compare the results with the previously obtained ones [MoklOd] 
where compatible and appropriate. 

The results currently are being gradually released in the iterative manner that were obtained 
through the corresponding versions of MARFCAT as it was being designed and developed. 
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5.1 Preliminary Results Summary 

The results summarize the half-training-full-testing data vs. that of regular ones reported in 
[MoklOdj . 

• Wireshark: 



• Tomcat: 

- CVEs (signal): 83.72%, CWEs (signal): 81.82%, 

- CVEs (NLP): 87.88%, CWEs (NLP): 39.39% 

• Chrome: 

- CVEs (signal): 90.91%, CWEs (signal): 100.00%, 

- CVEs (NLP): 100.00%, CWEs (NLP): 88.89% 

• Dovecot: 

- 14 warnings; but it appears all quality or false positive 

- (very hard to follow the code, severely undocumented) 

• Pebble: 

- none found during quick testing 

• Wireshark: 

- CVEs (signal): 92.68%, CWEs (signal): 86.11%, 

- CVEs (NLP): 83.33%, CWEs (NLP): 58.33% 

• Tomcat: 

- CVEs (signal): 83.72%, CWEs (signal): 81.82%, 

- CVEs (NLP): 87.88%, CWEs (NLP): 39.39% 

• Dovecot 1.2.x: (ongoing of this writing) 

• Jetty: (ongoing of this writing) 

• Wordpress: (ongoing of this writing) 

• Chrome: 

- CVEs (signal): 90.91%, CWEs (signal): 100.00%, 

- CVEs (NLP): 100.00%, CWEs (NLP): 88.89% 

• Dovecot (new, 2.x): 

- 14 warnings; but it appears all quality or false positive 



- CVEs 



CVEs 



(signal): 92.68%, CWEs (signal): 86.11% 
(NLP): 83.33%, CWEs (NLP): 58.33% 
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— (very hard to follow the code, severely undocumented) 
• Pebble: 

— none found during quick testing 

What follows are some select statistical measurements of the precision in recognizing CVEs 
and CWEs under different configurations using the signal processing and NLP processing tech- 
niques. 

"Second guess" statistics provided to see if the hypothesis that if our first estimate of a 
CVE/CWE is incorrect, the next one in line is probably the correct one. Both are counted if 
the first guess is correct. 

A sample signal visiusalization in the middle of a vulnerable file packet- af s . c in Wireshark 



1.2.0 to CVE-2009-2562 is in Figure [3] in the wave form. The low "dips" represent the text line 
endings (coupled with a preceding character (bytes) in bigrams (two PCM-signed bytes assumed 
encoded in 8kHz representing the amplitude; normalized), which are often either semicolons, 
closing or opening braces, brackets or parentheses). Only a small fragment is shown of roughly 
300 bytes in length to be visually comprehensive of a nature of a signal we a dealing with. 

In Figure |4j there are 3 spectrograms generated for the same file packet -af s . c[ The first two 
columns represent the CVE-2009-2562 -vulnerable file, both versions are the same with ehanced 
contrast to see the detail. The subsequent pairs are of the same file in Wireshark 1.2.9 and 
Wireshark 1.2.18, where CVE-2009-2562 is no longer present. Small changes are noticeable 
primarily in the bottom left and top right corners of the images, and even smaller elsewhere in 
the images. 



5.2 Version SATE-IV.l 

5.2.1 Half- Training Data For Training and Full For Testing 

This is one of the experiment per discussion with Aurelien Delaitre and SATE organizers. The 
main idea is to test robustness and precision of the MARFCAT approach by artificially reducing 
known weaknesses (their locations) to learn from by 50%, but test on the whole 100% to see 
how much does precision degrade with such a reduction. 

Supplying only CWE classes testing for this experiment (CVE classes make little sense). 
Only the first 50% of the entries entries were used for training for Wireshark 1.2.0, Tomcat 
5.5.13, and Chrome 5.0.375.54, while the full 100% were used to test the precision changes. The 
below are the results. 

It should be noted that CWE classification is generally less accurate due to lots of things 
stuffed (by NVD) into very broad categories such as NVD-CWE-Other and NVD-CWE-noinfo. 
Additionally, since we arbitrarily picked the first 50% of the training data, some of the CWEs 
simply were left out completely and not trained on if they were entirely in the omitted half, so 
their individual precision is obviously 0% when tested for. 

The archive contains the .log and the .xml files (the latter for now are in SATE format only 
with the scientific notation +E3 removed). The best reports are: 



report- ewe idnoprepreprawf ft cheb- wireshark- 1 


2 


0-half -train- ewe . xml 


report- ewe idnoprepreprawfftdiff -wireshark- 1 


2 


0-half -train- ewe . xml 



The experiments are subdivided into regular (signal) and NLP based testing. 
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Mid-sectjon for packet-afs.c 

Signal in "time" domain (length) 




— Column B 



4800 4850 4900 4950 5000 

Bigrams (file length) 



5050 



5100 



5150 



Figure 3: A wave graph of a fraction of the CVE-2009-2562 -vulnerable packet-afs.c in Wire- 
shark 1.2.0 



Signal. 



• Wireshark 1.2.0: 



Reduction of the training data by half resulted in r* 14% precision drop compared to the 
previous result (best 86.11% see the NIST report [Mokllj . vs. 72.22% overall). 



New results (by algorithms, then by CWEs): 



14 



MARFCAT: A MARF Approach to SATE IV 



Mokhov, Paquet, Debbabi, Sun 




Figure 4: Spectrograms of CVE-2009-2562 -vulnerable packet-afs.c in Wireshark 1.2.0, fixed 
Wireshark 1.2.9 and Wireshark 1.2.18 
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• Tomcat 5.5.13: 



Drop from 81.82% (see NIST report's Table 7, p. 70) to 75% top result as a result (about 
7 points) of training data reduction by 50%. 



New precision estimates: 
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• Chrome 5.0.375.54: 



Chrome result is for completeness even though it is not a test case for SATE IV. 



Chrome is poor for some reason - drop from 100% (Table 5, p. 68) to 44.44%, but it's only 
9 entries. The first result below is invalid, i.e. with a poor recall (the sum of 2 + < 9, 
should be total 9; I haven't looked at yet as of why). 
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NLP. Generally this genre of classification was poor as before in this experiment, all around 
40-45% percent precision, but: 

• Wireshark 1.2.0: 



New results (by algos, then by CWEs): 
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• Tomcat 5.5.13: 



Strangely, the best result is higher than with all of the date in the past report (42.42% 
below vs. previous 39.39%). 



guess 


run 


algorithms 


good 


bad 


% 


1st 


1 


-cweid -nopreprep -char -unigram -add-delta 


14 


19 


42.42 


2nd 


1 


-cweid -nopreprep -char -unigram -add-delta 


18 


15 


54.55 


guess 


run 


class 


good 


bad 


% 


1st 


1 


CWE-255 


1 





100.00 


1st 


2 


CWE-264 


2 





100.00 


1st 


3 


CWE-119 


1 





100.00 


1st 


4 


CWE-20 


1 





100.00 


1st 


5 


CWE-22 


7 


9 


43.75 


1st 


6 


CWE-200 


1 


3 


25.00 


1st 


7 


CWE-79 


1 


6 


14.29 


1st 


8 


CWE-16 





1 


0.00 


2nd 


1 


CWE-255 


1 





100.00 


2nd 


2 


CWE-264 


2 





100.00 


2nd 


3 


CWE-119 


1 





100.00 


2nd 


4 


CWE-20 


1 





100.00 


2nd 


5 


CWE-22 


11 


5 


68.75 


2nd 


6 


CWE-200 


1 


3 


25.00 


2nd 


7 


CWE-79 


1 


6 


14.29 


2nd 


8 


CWE-16 





1 


0.00 



• Chrome 5.0.375.54: 



Here drop is twice as much (« 44% vs. 88%). 
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5.3 Version SATE-IV.2 

These runs represent using the same SATE2010 training data for Tomcat 5.5.13 Wireshark 1.2.0 
to test the updated fixed versions (as from SATE2010) to Tomcat 5.5.33 and Wireshark 1.2.18 
using the same settings. At this run, no new CVEs that may have happened from the previous 
fixed versions of Tomcat 5.5.29 and Wireshark 1.2.9 respectively in 2010 were added to the 
training data for the versions being tested in this experiment as to see if any old issues reoccur 
or not. In this short summary, both signal and NLP testing reveal no same known issues found. 

• SATE-IV.2-train-test-test-run-quick-tomcat-5-5-33-cve 
This is CVE-based classical signal classification. 

A typical MARFCAT run. Tomcat 5.5.13 used for training. For most reports, no warnings 
were spotted based on what was learned from 5.5.13, so the reports convey earlier CVEs 
were fixed. 

Empty reports like: 

report-noprepreprawf ft cheb- train- test- test-run-quick- tomcat- 5- 5-33- eve .xml 



However, the -cos report is noisy and non-empty: 

report-noprepreprawf ft cos- train- test- test-run- quick- tomcat- 5- 5- 33- eve .xml 



Overly detailed log files are also provided. 

SATE-IV.2-train-test-test-run-quick-tomcat-5-5-33-cwe 
This is classical CWE-based testing. 

A typical MARFCAT CWE run. Tomcat 5.5.13 used for training. 
No warnings found based on the CVE data learned. 
Most of the reports are empty, e.g.: 



report-nopreprepcharunigramadddelt a- train- test- test-run-quick- tomcat- 5- 5— 33— cve-nlp . 



xml 

The -cos report is not as noisy as for CVEs, but still contains a couple of false positives. 



report- cweidnoprepreprawf ft cos- train- test- test-run- quick- tomcat- 5- 5- 33- ewe 



xml 

Overly detailed log files also provided. 



Training and testing indexes are provided (*_test.xml and *_train.xml). 

SATE-IV.2-train-test-test-run-quick-tomcat-5-5-33-cve-nlp 
This is CVE-based NLP testing. 

A typical MARFCAT NLP run. Tomcat 5.5.13 used for training. Usually a slow run, so 
only one configuration is tried. No warnings found based on the CVE data learned. 

The only empty report is: 



report-nopreprepcharunigramadddelt a- train- test- test-run-quick- tomcat- 5- 5-33- cve-nlp . 



xml 

However, the -cos report is noisy and non-empty: 



report-noprepreprawf ft cos- train- test- test -run- quick- tomcat-5-5-33- eve .xml 
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Overly detailed log files also provided. 



Training and testing indexes are provided (*_test.xml and *_train.xml). 

• SATE-IV.2-train-test-test-run-quick-tomcat-5-5-33-cwe-nlp 
This is CWE-based NLP testing. 

A typical MARFCAT CWE NLP run. Tomcat 5.5.13 used for training. Usually a slow 
run, so only one configuration is tried. No warnings found based on the CVE data learned. 

The only empty report is: 



report- cweidnopreprepcharunigramadddelt a- train- test- test-run- quick- tomcat- 5- 5- 33- cwe-nl] 



IxiD 

Overly detailed log files also provided. 

• SATE-IV.2-train-test-test-run-quick-wireshark-l-2-18-cve 

Test Wireshark 1.2.18 using the training data from Wireshark 1.2.0 and classical CVE- 
based processing. 

Majority of algorithms returned empty reports, -cos was as noisy as usual, but -mink 
was non-empty but quite short (though also presumed with false positives). 

Empty reports: 

report-noprepreprawf f tcheb-train-test-test-run-quick-wireshark-l-2-18-cve .xml 
report-noprepreprawf f tdif f -train- test- test-run-quick- wireshark- 1-2- 18- eve .xml 
report-noprepreprawf fteucl-train- test- test-run-quick- wireshark- 1-2- 18- eve .xml 
report-noprepreprawf fthamming-train-test-test-run-quick-wireshark- 1-2- 18-cve . xml 
Non empty reports: 

report-noprepreprawf ftcos-train-test-test-run-quick-wireshark- 1-2- 18-cve .xml 
report-noprepreprawf ftmink-train-test-test-run-quick-wireshark- 1-2- 18- eve .xml 

Verbose log files and input index files are also supplied for the most cases. 

[TODO] 

5.4 Version SATE-IV.5 
5.4.1 Wavelet Experiments 

The preliminary experiments using the separating discreet wavelet transform (DWT) filter are 
summarized in Table [T] and Table [3] for CVEs and CWEs respectively. For comparison, the 
low-pass FFT filter is used for the same as shown in Table [2] and Table [4] respectively. For the 
CVE experiments, the wavelet transforms overall produces better precision across configurations 
(larger number of configurations produce higher precision result) than those with the low-pass 
FFT filter. While the top precision result remains the same, it is shown than when filtering is 
wanted, the wavelet transform is perhaps a better choice for some configurations, e.g. from 4 and 
below as well as for the 2nd guess statistics. The very top result for the CWE based processing 
so far exceeds the overall precision of separating DWT vs. low-pass FFT, which then drops 
below for the subsequent configurations, -cos was dropped from Table [3] for technical reasons. 
In Figure [5] is a spectrogram with the SDWT preprocessing in the pipeline. More exploration 
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in this area is under way for more advanced wavelet filters than the simple separating DWT 
filter as to see whether they would outperform -raw or not and at the same time minimizing 
the run-time performance decrease with the extra filtering. 



s 



Wm 

mm 

if 



Figure 5: A spectrogram of |CVE- 2009-2562 -vulnerable packet-afs.c in Wireshark 1.2.0, after 
SDWT 



6 Conclusion 



We review the current results of this experimental work, its current shortcomings, advantages, 
and practical implications. 



6.1 Shortcomings 

The below is a list of most prominent issues with the presented approach. Some of them are 
more "permanent", while others are solvable and intended to be addressed in the future work. 
Specifically: 



• Looking at a signal is less intuitive visually for code analysis by humans. (However, can 
produce a problematic "spectrogram" in some cases). 

• Line numbers are a problem (easily "filtered out" as high-frequency "noise", etc.). A 
whole "relativistic" and machine learning methodology developed for the line numbers 
in [MoklOd] to compensate for that. Generally, when CVEs are the primary class, by 
accurately identifying the CVE number one can get all the other pertinent details from 
the CVE database, including patches and line numbers making this a lesser issue. 
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Accuracy depends on the quality of the knowledge base (see Section 4.2 ) collected. Some of 
this collection and annotation is manual to get the indexes right, and, hence, error prone. 
"Garbage in - garbage out." 

To detect more of the useful CVE or CWE signatures in non-CVE and non-CWE cases 
requires large knowledge bases (human- intensive to collect), which can perhaps be shared 
by different vendors via a common format, such as SATE, SAFES or Forensic Lucid. 

No path tracing (since no parsing is present); no slicing, semantic annotations, context, 
locality of reference, etc. The "sink", "path", and "fix" results in the reports also have to 
be machine-learned. 



• A lot of algorithms and their combinations to try (currently ~ 1800 permutations) to get 
the best top N. This is, however, also an advantage of the approach as the underlying 
framework can quickly allow for such testing. 

• File-level training vs. fragment-level training - presently the classes are trained based 
on the entire file where weaknesses are found instead of the known file fragments from 
CVE-reported patches. The latter would be more fine-grained and precise than whole-file 
classification, but slower. However, overall the file-level processing is a man-hour limitation 
than a technological one. 

• Separating wavelet filter performance is rather adversely affects the precision to low levels. 

• No nice GUI. Presently the application is script/command-line based. 

6.2 Advantages 

There are some key advantages of the approach presented. Some of them follow: 



• Relatively fast (e.g. Wireshark's ~ 2400 files train and test in about 3 minutes) on a 
now-commodity desktop or a laptop. 

• Language-independent (no parsing) - given enough examples can apply to any language, 
i.e. methodology is the same no matter C, C++, Java or any other source or binary 
languages (PHP, C#, VB, Perl, bytecode, assembly, etc.) are used. 

• Can automatically learn a large knowledge base to test on known and unknown cases. 

• Can be used to quickly pre-scan projects for further analysis by humans or other tools 
that do in-depth semantic analysis as a means to prioritize. 

• Can learn from SATE'08, SATE'09, SATE'10, and SATE IV reports. 

• Generally, high precision (and recall) in CVE and CWE detection, even at the file level. 

• A lot of algorithms and their combinations to select the best for a particular task or class 
(see Section 4.3). 

• Can cope with altered code or code used in other projects (e.g. a lot of problems in Chrome 
were found it WebKit, used by several browsers). 
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6.3 Practical Implications 

Most practical implications of all static code analyzers are obvious - to detect and report source 
code weaknesses and report them appropriately to the developers. We outline additional impli- 
cations this approach brings to the arsenal below: 



• The approach can be used on any target language without modifications to the method- 
ology or knowing the syntax of the language. Thus, it scales to any popular and new 
language analysis with a very small amount of effort. 

• The approach can nearly identically be transposed onto the compiled binaries and byte- 
code, detecting vulnerable deployments and installations - sort of like virus scanning of 
binaries, but instead scanning for infected binaries, one would scan for security-weak bina- 
ries on site deployments to alert system administrators to upgrade their packages. XXX: 
The experiments in this area are ongoing. 

• Can learn from binary signatures from other tools like Snort [Soul2]. 

• The approach is easily extendable to the embedded code and mission-critical code found 
in aircraft, spacecraft, and various autonomous systems. 



6.4 Future Work 

There is a great number of possibilities in the future work. This includes improvements to 
the code base of MARFCAT as well as resolving unfinished scenarios and results, addressing 



shortcomings in Section 6.1, testing more algorithms and combinations from the related work, 
and moving onto other programming languages (e.g. ASP, C#). Furthermore, plan to conceive 
collaboration with vendors such as VeraCode, Coverity, and others who have vast data sets to 
test the full potential of the approach with the others and a community as a whole. Then move 
on to dynamic code analysis as well applying similar techniques there. 

There is a great number of possibilities in the future work. This includes resolving unfin- 



ished scenarios and results, addressing shortcomings in Section 6.1 testing more algorithms and 
combinations from the related work, and moving onto other programming languages (e.g. ASP, 
C#). Furthermore, foster collaboration with the academic, industry and government vendors 
that may have vast data sets to test the full potential of the approach with the others and a 
community as a whole. Then, move on to dynamic code analysis as well applying similar tech- 
niques there. Other near-future work items include realization of the SVM-based classification, 
data export in SAFES and Forensic Lucid formats, a lot of wavelet filtering improvements, and 
distributed GIPSY cluster-based evaluation. 

To improve detection and classification of the malware in the network traffic or otherwise 
we employ machine learning approach to static pcap payload malicious code analysis and fin- 
gerprinting using the open-source MARF framework and its MARFCAT application, originally 
designed for the SATE static analysis tool exposition workshop. We first train on the known 
malware pcap data and measure the precision and then test it on the unseen, but known data 
and select the best available machine learning combination to do so. This work elaborates 
on the details of the methodology and the corresponding results of application of the machine 
learning techniques along with signal processing and NLP alike to static network packet anal- 
ysis in search for malicious code in the packet capture (pcap) data, malicious code analysis 
jBOB+lOl ISEZSOll ISXC+041 IHJ071 IHRSS071 ISue07l IRM081 lBOA+07] We show the system the 
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examples of pcap files with malware and MARFCAT learns them by computing spectral sig- 
natures using signal processing techniques. When we test, we compute how similar or distant 
each file is from the known trained-on malware-laden files. In part, the methodology can ap- 
proximately be seen as some signature-based "antivirus" or IDS software systems detect bad 
signature, except that with a large number of machine learning and signal processing algorithms, 
we test to find out which combination gives the highest precision and best run-time. At the 
present, however, we are looking at the whole pcap files. This aspect lowers the precision, but 
is fast to scan all the files. The malware database with known malware, the reports, etc. serves 
as a knowledge base to machine-learn from. Thus, we primarily: 

• Teach the system from the known cases of malware from their pcap data 

• Test on the known cases 

• Test on the unseen cases 
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A Classification Result Tables 

What follows are result tables with top classification results ranked from most precise at the 
top. This include the configuration settings for MARF by the means of options (the algorithm 
implementations are at their defaults [Mok07j). 



B Forensic Lucid Report Example 

An example report encoding the reported data in Forensic Lucid for Wireshark 1.2.0 after using 
simple FFT-based feature extraction and Chebyshev distance as a classifier. The report provides 
the same data, compressed, as the SATE XML, but in the Forensic Lucid syntax for automated 
reasoning and event reconstruction during a digital investigation. The example is a an evidential 
statement context encoded for the use in the investigator's knowledge base of a particular case. 

#F0RENSICLUCID 

evidential statement report_marf cat_0_0_2_SATE_IV^4 
{ 

weakness_i @ [id: 1 , tool_specif ic_id : 1 , cweid : 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_l = (locations_wk_l , 1, 0, 1.0); 

locations_wk_l = locations @ [tool_specif ic_id : 1 , cweid:999, cwename : "Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_340 ( [line => 828, path => "wireshark-1 . 2 . O/epan/dissectors/packet-af s . c") 
observation location_id_340 ( [line => 1718, path => "wireshark-1 . 2 . O/epan/dissectors/packet-af s . c") 
observation location_id_340 ( [line => 1729, path => "wireshark-1 . 2 . O/epan/dissectors/packet-af s . c") 
observation location_id_340 ( [line => 1740, path => "wireshark-1 . 2 . O/epan/dissectors/packet-af s . c") 
observation location_id_340 ( [line => 1747, path => "wireshark-1 . 2 . O/epan/dissectors/packet-af s . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness J2 @ [id: 2 , tool_specif ic_id : 2 , cweid : 119 , cwename : "Buffer Errors (CWE119) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_2 = (locations_wk_2 , 1, 0, 1.0); 

locations_wk_2 = locations @ [tool_specif ic_id : 2 , cweid:119, cwename : "Buff er Errors (CWE119)"]; 

observation location_id_411 ( [line => 830, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ber . c") 
observation location_id_411 ( [line => 861, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ber . c") 
observation location_id_411 ( [line => 885, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ber . c") 
textoutput=" " ; 

observation grade = ([ severity => 1, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_3 @ [id: 3 , tool_specif ic_id : 3 , cweid : 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic^id, cweid, cwename ; 

observation sequence weakness. 3 = (locations_wk_3 , 1, 0, 0.004878625561362933); 

locations_wk_3 = locations @ [tool_specif ic.id : 3 , cweid:999, cwename : "Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_433 ( [line => 669, path => "wireshark-1 . 2 . 0/epan/dissectors/packet-btl2cap . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 204.97576364943077], 1, 0, 0.004878625561362933); 

end; 

weakness_4 @ [id: 4 , tool_specif ic^id : 4 , cweid : 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic^id, cweid, cwename ; 

observation sequence weakness_4 = (locations_wk_4 , 1, 0, 1.0); 

locations_wk_4 = locations @ [tool_specif ic_id : 4 , cweid:999, cwename : "Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_550 ( [line => 248, path => "wireshark-1 . 2 . O/epan/dissectors/packet-dcerpc-nt . c") 
observation location_id_550 ( [line => 252, path => "wireshark-1 . 2 . O/epan/dissectors/packet-dcerpc-nt . c") 
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Table 1: CVE Stats for Wireshark 1.2.0, Separating DWT Wavelet Filter Preprocessing 



guess 


run 


algorithms 


good 


bad 


% 


1st 


1 


-nopreprep -sdwt -fft -dif f -spectrogram -graph -f lucid 


37 


4 


90.24 


1st 


2 


-nopreprep -sdwt -fft -cheb -spectrogram -graph -f lucid 


37 


4 


90.24 


1st 


3 


-nopreprep -sdwt -fft -eucl -spectrogram -graph -f lucid 


27 


14 


65.85 


1st 


4 


-nopreprep -sdwt -fft -hamming -spectrogram -graph -f lucid 


26 


15 


63.41 


1st 


5 


-nopreprep -sdwt -fft -mink -spectrogram -graph -f lucid 


22 


19 


53.66 


1st 


6 


-nopreprep -sdwt -fft -cos -spectrogram -graph -flucid 


38 


65 


36.89 


2nd 


1 


-nopreprep -sdwt -fft -dif f -spectrogram -graph -flucid 


39 


2 


95.12 


2nd 


2 


-nopreprep -sdwt -fft -cheb -spectrogram -graph -flucid 


39 


2 


95.12 


2nd 


3 


-nopreprep -sdwt -fft -eucl -spectrogram -graph -flucid 


35 


6 


85.37 


2nd 


4 


-nopreprep -sdwt -fft -hamming -spectrogram -graph -flucid 


29 


12 


70.73 


2nd 


5 


-nopreprep -sdwt -fft -mink -spectrogram -graph -flucid 


31 


10 


75.61 


2nd 


6 


-nopreprep -sdwt -fft -cos -spectrogram -graph -flucid 


39 


64 


37.86 


guess 


run 


class 




good 


bad 


% 


1st 


1 


CVE-2009-3829 




6 





100.00 


1st 


2 


CVE-2009-2562 




6 





100.00 


1st 


3 


CVE-2009-4378 




6 





100.00 


1st 


4 


CVE-2010-2286 




6 





100.00 


1st 


5 


CVE-2010-0304 




6 





100.00 


1st 


6 


CVE-2009-4376 




6 





100.00 


1st 


7 


CVE-2010-2283 




6 





100.00 


1st 


8 


CVE-2009-3551 




6 





100.00 


1st 


9 


CVE-2009-3550 




6 





100.00 


1st 


10 


CVE-2009-3549 




6 





100.00 


1st 


11 


CVE-2009-2563 




6 


2 


75.00 


1st 


12 


CVE-2009-2560 




11 


4 


73.33 


1st 


13 


CVE-2009-3241 




15 


9 


62.50 


1st 


14 


CVE-2010-1455 




31 


23 


57.41 


1st 


15 


CVE-2009-2561 




6 


6 


50.00 


1st 


16 


CVE-2010-2287 




6 


6 


50.00 


1st 


17 


CVE-2009-2559 




6 


6 


50.00 


1st 


18 


CVE-2009-3243 




16 


16 


50.00 


1st 


19 


CVE-2010-2285 




6 


7 


46.15 


1st 


20 


CVE-2009-4377 




12 


16 


42.86 


1st 


21 


CVE-2010-2284 




6 


9 


40.00 


1st 


22 


CVE-2009-3242 




6 


17 


26.09 


2nd 


1 


CVE-2009-3829 




6 





100.00 


2nd 


2 


CVE-2009-2562 




6 





100.00 


2nd 


3 


CVE-2009-4378 




6 





100.00 


2nd 


4 


CVE-2010-2286 




6 





100.00 


2nd 


5 


CVE-2010-0304 




6 





100.00 


2nd 


6 


CVE-2009-4376 




6 





100.00 


2nd 


7 


CVE-2010-2283 




6 





100.00 


2nd 


8 


CVE-2009-3551 




6 





100.00 


2nd 


9 


CVE-2009-3550 




6 





100.00 


2nd 


10 


CVE-2009-3549 




6 





100.00 


2nd 


11 


CVE-2009-2563 




6 


2 


75.00 


2nd 


12 


CVE-2009-2560 




12 


3 


80.00 


2nd 


13 


CVE-2009-3241 




16 


8 


66.67 


2nd 


14 


CVE-2010-1455 




43 


11 


79.63 


2nd 


15 


CVE-2009-2561 




6 


6 


50.00 


2nd 


16 


CVE-2010-2287 




12 





100.00 


2nd 


17 


CVE-2009-2559 




6 


6 


50.00 


2nd 


18 


CVE-2009-3243 




19 


13 


59.38 


2nd 


19 


CVE-2010-2285 




6 


7 


46.15 


2nd 


20 


CVE-2009-4377 




12 


16 


42.86 


2nd 


21 


CVE-2010-2284 




6 


9 


40.00 


2nd 


22 


CVE-2009-3242 




8 


15 


34.78 
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Table 2: CVE Stats for Wireshark 1.2.0, Low-Pass FFT Filter Preprocessing 



guess 


run 


algorithms 


good 


bad 


% 


1st 


1 


-nopreprep -low -f ft -cheb -f lucid 


37 


4 


90.24 


1st 


2 


-nopreprep -low -f ft -dif f -f lucid 


37 


4 


90.24 


1st 


3 


-nopreprep -low -f ft -eucl -f lucid 


27 


14 


65.85 


1st 


4 


-nopreprep -low -f ft -hamming -f lucid 


23 


18 


56.10 


1st 


5 


-nopreprep -low -fft -mink -f lucid 


22 


19 


53.66 


1st 


6 


-nopreprep -low -fft -cos -f lucid 


36 


114 


24.00 


2nd 


1 


-nopreprep -low -fft -cheb -f lucid 


38 


3 


92.68 


2nd 


2 


-nopreprep -low -fft -diff -f lucid 


38 


3 


92.68 


2nd 


3 


-nopreprep -low -fft -eucl -f lucid 


34 


7 


82.93 


2nd 


4 


-nopreprep -low -fft -hamming -f lucid 


26 


15 


63.41 


2nd 


5 


-nopreprep -low -fft -mink -f lucid 


31 


10 


75.61 


2nd 


6 


-nopreprep -low -fft -cos -f lucid 


39 


111 


26.00 


guess 


run 


class 


good 


bad 


% 


1st 


1 


CVE-2009-3829 


6 





100.00 


1st 


2 


CVE-2009-4376 


6 





100.00 


1st 


3 


CVE-2010-0304 


6 





100.00 


1st 


4 


CVE-2010-2286 


6 





100.00 


1st 


5 


CVE-2010-2283 


6 





100.00 


1st 


6 


CVE-2009-3551 


6 





100.00 


1st 


7 


CVE-2009-3549 


6 





100.00 


1st 


8 


CVE-2009-3241 


15 


9 


62.50 


1st 


9 


CVE-2009-2560 


9 


6 


60.00 


1st 


10 


OVE-2010-1455 


30 


24 


55.56 


1st 


11 


CVE-2009-2563 


6 


5 


54.55 


1st 


12 


CVE-2009-2562 


6 


5 


54.55 


1st 


13 


CVE-2009-2561 


6 


7 


46.15 


1st 


14 


CVE-2009-4378 


6 


7 


46.15 


1st 


15 


CVE-2010-2287 


6 


7 


46.15 


1st 


16 


CVE-2009-3550 


6 


8 


42.86 


1st 


17 


CVE-2009-3243 


13 


23 


36.11 


1st 


18 


CVE-2009-4377 


12 


22 


35.29 


1st 


19 


CVE-2010-2285 


6 


11 


35.29 


1st 


20 


CVE-2009-2559 


6 


11 


35.29 


1st 


21 


OVE-2010-2284 


6 


12 


33.33 


1st 


22 


CVE-2009-3242 


7 


16 


30.43 


2nd 


1 


CVE-2009-3829 


6 





100.00 


2nd 


2 


CVE-2009-4376 


6 





100.00 


2nd 


3 


CVE-2010-0304 


6 





100.00 


2nd 


4 


CVE-2010-2286 


6 





100.00 


2nd 


5 


CVE-2010-2283 


6 





100.00 


2nd 


6 


CVE-2009-3551 


6 





100.00 


2nd 


7 


CVE-2009-3549 


6 





100.00 


2nd 


8 


CVE-2009-3241 


16 


8 


66.67 


2nd 


9 


CVE-2009-2560 


10 


5 


66.67 


2nd 


10 


CVE-2010-1455 


44 


10 


81.48 


2nd 


11 


CVE-2009-2563 


6 


5 


54.55 


2nd 


12 


CVE-2009-2562 


6 


5 


54.55 


2nd 


13 


CVE-2009-2561 


6 


7 


46.15 


2nd 


14 


CVE-2009-4378 


6 


7 


46.15 


2nd 


15 


CVE-2010-2287 


13 





100.00 


2nd 


16 


CVE-2009-3550 


6 


8 


42.86 


2nd 


17 


CVE-2009-3243 


13 


23 


36.11 


2nd 


18 


CVE-2009-4377 


12 


22 


35.29 


2nd 


19 


CVE-2010-2285 


6 


11 


35.29 


2nd 


20 


CVE-2009-2559 


6 


11 


35.29 


2nd 


21 


CVE-2010-2284 


6 


12 


33.33 


2nd 


22 


CVE-2009-3242 


8 


15 


34.78 
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Table 3: CWE Stats for Wireshark 1.2.0, Separating DWT Wavelet Filter Preprocessing 



guess 


run 


algorithms 


good 


Karl 


/o 


1st 


\ 


cweid nopreprep — sdwt — f ft dif f f lucid 


31 




86.11 


1 c+ 
1ST 


o 

z 


— cweid —nopreprep -sdwt _ f ft — eucl — f lucid 


9Q 

zy 


7 


oU.OD 


1 c+ 
1ST 


Q 
O 


—cweid — nopreprep -sdwt — f ft -mink — f lucid 


1 7 


1 Q 

iy 


AV 99 
f .ZZ 


1 cf 

IbTj 


A 


~cwsid -nopreprsp —sdwt — f ft —hamming — f lucid 


1 A 


99 

zz 


oo.oy 


2nd 


I 


LWclU 11 up J. cUi U (J bUWL J. J_ L. U.J. J_ ± lllitlU 


33 


3 


91.67 


ZI1Q 


O 

z 


—cweid — nopreprep —sdwt — f ft —eucl — f lucid 


1A 


9 

z 


QA AA 


ZI1Q 


Q 
O 


—cweid —nopreprep —sdwt — f ft —mink —f lucid 


97 
Z i 


Q 

y 


7^ nn 


ZI1Q 


A 


—cweid —nopreprep —sdwt — fft —hamming — f lucid 


zo 


lO 


oo.oy 


guess 


run 


class 


crnnrl 
6 UUU 


bad 


% 


1st 


i 


CWE399 


4 





100.00 


1st 


2 


CWE 189 


4 





100.00 


1st 


3 


NVD-CWE-Othcr 


11 


1 


91.67 


1st 


4 


CWE20 


30 


10 


75.00 


1st 


5 


NVD-CWE-noinfo 


34 


34 


50.00 


1st 


6 


CWE119 


8 


8 


50.00 


2nd 


1 


CWE399 


4 





100.00 


2nd 


2 


CWE 189 


4 





100.00 


2nd 


3 


NVD-CWE-Other 


11 


1 


91.67 


2nd 


4 


CWE20 


34 


6 


85.00 


2nd 


5 


NVD-CWE-noinfo 


53 


15 


77.94 


2nd 


6 


CWE119 


11 


5 


68.75 



Table 4: CWE Stats for Wireshark 1.2.0, Low-Pass FFT Filter Preprocessing 



guess 


run 


algorithms 


good 


bad 


% 


1st 


1 


-cweid -nopreprep -low -fft -diff -f lucid 


30 


6 


83.33 


1st 


2 


-cweid -nopreprep -low -fft -cheb -f lucid 


30 


6 


83.33 


1st 


3 


-cweid -nopreprep -low -fft -eucl -f lucid 


25 


11 


69.44 


1st 


4 


-cweid -nopreprep -low -fft -mink -f lucid 


20 


16 


55.56 


1st 


5 


-cweid -nopreprep -low -fft -cos -f lucid 


36 


40 


47.37 


1st 


6 


-cweid -nopreprep -low -fft -hamming -f lucid 


12 


24 


33.33 


2nd 


1 


-cweid -nopreprep -low -fft -diff -f lucid 


31 


5 


86.11 


2nd 


2 


-cweid -nopreprep -low -fft -cheb -f lucid 


31 


5 


86.11 


2nd 


3 


-cweid -nopreprep -low -fft -eucl -f lucid 


30 


6 


83.33 


2nd 


4 


-cweid -nopreprep -low -fft -mink -f lucid 


22 


14 


61.11 


2nd 


5 


-cweid -nopreprep -low -fft -cos -f lucid 


48 


28 


63.16 


2nd 


6 


-cweid -nopreprep -low -fft -hamming -f lucid 


16 


20 


44.44 


guess 


run 


class 


good 


bad 


% 


1st 


1 


CWE399 


6 


1 


85.71 


1st 


2 


CWE20 


48 


12 


80.00 


1st 


3 


NVD-CWE-Other 


18 


7 


72.00 


1st 


4 


CWE189 


6 


3 


66.67 


1st 


5 


NVD-CWE-noinfo 


61 


61 


50.00 


1st 


6 


CWE119 


14 


19 


42.42 


2nd 


1 


CWE399 


6 


1 


85.71 


2nd 


2 


CWE20 


48 


12 


80.00 


2nd 


3 


NVD-CWE-Other 


18 


7 


72.00 


2nd 


4 


CWE189 


6 


3 


66.67 


2nd 


5 


NVD-CWE-noinfo 


78 


44 


63.93 


2nd 


6 


CWE119 


22 


11 


66.67 
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observation 


location 


id_ 


550 ( 


[line 


=> 


observation 


location 


id_ 


550 ( 


[line 


=> 


observation 


location 


id_ 


550 C 


[line 


=> 


observation 


location 


id_ 


550 ( 


[line 


=> 


observation 


location 


.id. 


550 ( 


[line 


=> 


observation 


location 


id_ 


550 ( 


[line 


=> 


observation 


location 


id_ 


550 ( 


[line 


=> 


observation 


location 


id_ 


550 C 


[line 


=> 


observation 


location 


.id. 


550 ( 


[line 


=> 


observation 


location 


id_ 


550 ( 


[line 


=> 


text output = 












observation 


grade = ([ £ 


everity => 


5, 



1138 , path => "wireshark-1 .2.0/ epan/dis sector s/packet-dcerpc-nt . c 

1142 , path => "wireshark-1 .2.0/ epan/dis sector s/packet-dcerpc-nt . c 

1146 , path => "wireshark-1 .2.0/ epan/dissector s/packet-dcerpc-nt . c 

1201 , path => "wireshark-1 .2.0/ epan/dis sector s/packet-dcerpc-nt . c 

1205, path => "wireshark-1. 2. O/epan/dissectors/packet-dcerpc-nt.c 

1209 , path => "wireshark-1 .2.0/ epan/dissector s/packet-dcerpc-nt . c 



.specif ic_rank => 0.0], 1, 0, 1.0); 
end; 

weakness_5 @ [id: 5 , tool.specif ic_id : 5 , cweid : 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool^specif ic_id, cweid, cwename ; 

observation sequence weakness, 5 = (locations_wk_5 , 1, 0, 0.003778693428627627); 

locations_wk_5 = locations @ [tool_specif ic_id:5, cweid:999, cwename: "Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location„id_652 ( [line => 77, path => "wireshark-1 . 2 . O/epan/dissectors/packet-dtls . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 264.64173897356557], 1, 0, 0.003778693428627627); 

end; 

weakness_6 @ [id: 6 , tool_specif ic_id : 6 , cweid : 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness.. 6 = (locations_wk_6 , 1, 0, 0.004125022212036806); 

locations_wk_6 = locations @ [tool_specif ic_id:6, cweid:999, cwename: "Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_763 ( [line => 8447, path => "wireshark-1 . 2 . 0/epan/dissectors/packet-gsm_a_rr . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 242.42293704067873], 1, 0, 0.004125022212036806); 

end; 

weakness_7 @ [id: 7 , tool_specif ic_id : 7 , cweid : 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_7 = (locations_wk_7 , 1, 0, 1.0); 

locations_wk_7 = locations @ [tool_specif ic_id:7, cweid:999, cwename: "Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_863 ( [line => 945, path => "wireshark-1 . 2 . O/epan/dissectors/packet-inf iniband . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_8 @ [id: 8 , tool^specif ic_id : 8 , cweid : 119 , cwename : "Buffer Errors (CWE119) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_8 = (locations_wk_8 , 1, 0, 1.0); 

locations_wk_8 = locations @ [tool_specif ic_id:8, cweid:119, cwename : "Buff er Errors (CWE119)"]; 

observation location_id_877 ( [line => 2746, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ipmi-se . c") 
observation location_id_877 ( [line => 2748, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ipmi-se . c") 
observation location_id_877 ( [line => 2752, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ipmi-se . c") 
textoutput=" " ; 

observation grade = ([ severity => 1, tool^specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_9 @ [id:9, tool_specif ic_id:9, cweid:998, cwename : "Other (NVD-CWE-Other) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_9 = (locations_wk_9 , 1, 0, 1.0); 

locations_wk_9 = locations @ [tool_specif ic_id:9, cweid:998, cwename : "Other (NVD-CWE-Other)"]; 

observation location_id_882 ( [line => 792, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ipmi . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_10 <S [id: 10 , tool. spec if ic_id: 10, cweid: 119 , cwename : "Buffer Errors (CWE119) "] 
where 

dimension id, tool_specif ic.id, cweid, cwename ; 

observation sequence weakness_10 = (locations_wk_10 , 1, 0, 1.0); 

locations_wk_10 = locations <S [tool_specif ic_id: 10, cweid:119, cwename : "Buff er Errors (CWE119)"]; 

observation location__id_969 ( [line => 523, path => "wireshark-1 . 2 . O/epan/dissectors/packet-lwres . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 1, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness. 11 <3 [id: 11 , tool_specif ic.id: 11 , cweid: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_ll = (locations_wk_ll, 1, 0, 1.0); 

locations_wk_ll = locations <S [tool_specif ic_id : 11 , cweid:20, cwename : "Input Validation (CWE20) "] ; 

observation location_id_1099 ( [line => 62, path => "wireshark-1 . 2 . O/epan/dissectors/packet-paltalk . c") 
textoutput=" " ; 

observation grade = ([ severity => 1, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_12 @ [id: 12 , tool_specif ic_id: 12 , cweid: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness. 12 = (locations_wk.l2 , 1, 0, 0.004878625561362927); 

locations_wk_12 = locations <3 [tool„specif ic.id: 12, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_1174( [line => 897, path => "wireshark-1 . 2 . O/epan/dissectors/packet-radius . c") 
observation location_id_1174( [line => 906 , path => "wireshark-1 . 2 . 0/epan/dis sectors/packet -radius . c") 
observation location_id_1174( [line => 913 , path => "wireshark-1 . 2 . 0/epan/dis sectors/packet -radius . c") 
observation location_id_1174( [line => 1005 , path => "wireshark-1 .2.0/ epan/dis sectors /packet -radius . c") 
observation location_id_1174( [line => 1227 , path => "wireshark-1 .2.0/ epan/dissectors /packet -radius . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool_specif ic_rank => 204.975763649431], 1, 0, 0.004878625561362927); 

end; 

weakness_13 <S [id: 13 , tool_specif ic_id: 13, cweid: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
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where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_13 = (locations_wk_13, 1, 0, 1.0); 

locations_wk_13 = locations <S [tool_specif ic_id: 13, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_1282 C [line => 1131 , path => " wire shark- 1 .2.0/ epan/di s sect or s /packet -sf low. c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_14 (3 [id:14, tool.specif ic_id: 14, cweid:998, cwename : "Other (NVD-CWE- Other) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_14 = (locations_wk_14, 1, 0, 1.0); 

locations_wk_14 = locations <S [tool.specif ic.id: 14, cweid:998, cwename : "Other (NVD-CWE-Other) "] ; 

observation location_id_1303( [line => 2141 , path => "wire shark- 1 .2.0/ epan/di ssect or s /packet -smb-pipe . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 0.0], 1, 0, 1.0); 

end; 

weakness_15 <3 [id:15, tool.specif ic.id: 15 , cweid:998, cwename : "Other (NVD-CWE-Other) "] 
where 

dimension id, tool.specif ic.id, ewe id, cwename ; 

observation sequence weakness_15 = (locations_wk_15 , 1, 0, 1.0); 

locations_wk_15 = locations @ [tool_specif ic_id: 15, cweid:998, cwename : "Other (NVD-CWE-Other)"]; 

observation location.id_1307( [line => 8757, path => "wireshark-1 . 2 . O/epan/dissectors/packet-smb. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_16 <S [id: 16 , tool.specif ic.id: 16 , ewe id: 189 , cwename : "Numeric Errors (CWE189) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_16 = (locations_wk_16 , 1, 0, 1.0); 

locations_wk_16 = locations <S [tool_specif ic.id: 16, cweid:189, cwename : "Numeric Errors (CWE189)"]; 
observation location_id_1307( [line => 2195, path => "wireshark-1 . 2 . O/epan/dissectors/packet-smb. c") 
textoutput=" " ; 

observation grade = ([ severity => 2, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness. 17 <3 [id: 17 , tool_specif ic_id: 17, ewe id: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness. 17 = (locations.wk_17, 1, 0, 0.008328136212759968); 

locations_wk_17 = locations <3 [tool.specif ic_id: 17, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_1307( [line => 8457, path => "wireshark-1 . 2 . O/epan/dissectors/packet-smb. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 120.07488523877028], 1, 0, 0.008328136212759968); 

end; 

weakness_18 <S [id: 18, tool_specif ic_id: 18, cweid:999, cwename: "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_18 = (locations_wk_18, 1, 0, 0.008328136212759964); 

locations_wk_18 = locations <3 [tool.specif ic.id: 18, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_1309 ( [line => 955, path => "wireshark-1 . 2 . 0/epan/dissectors/packet-smb2 . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 120.07488523877032], 1, 0, 0.008328136212759964); 

end; 

weakness_19 <S [id: 19 , tool_specif ic_id: 19 , cweid: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool.specif ic.id, cweid, cwename ; 

observation sequence weakness_19 = (locations_wk_19 , 1, 0, 0.004321352067642762); 

locations_wk_19 = locations <S [tool.specif ic.id: 19, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_1333( [line => 813, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ssl-utils . c" ) 
observation location.id_1333( [line => 843, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ssl-utils . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 231.40905539443497], 1, 0, 0.004321352067642762); 

end; 

weakness_20 <S [id: 20 , tool.specif ic.id: 20, cweid: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, cweid, cwename ; 

observation sequence weakness_20 = (locations_wk_20 , 1, 0, 0.0021114804997331383); 

locations_wk_20 = locations <S [tool.specif ic.id:20, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id.l334( [line => 153 , path => "wireshark-1 . 2 . O/epan/dissectors/packet-ssl-utils .h" ) 
textoutput=" " ; 

observation grade - ([ severity => 5, tool.specif ic.rank => 473.60134281438354], 1, 0, 0.0021114804997331383); 

end; 

weakness_21 <S [id: 21 , tool_specif ic_id: 21 , cweid: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool.specif ic.id, cweid, cwename ; 

observation sequence weakness. 21 = (locations_wk_21 , 1, 0, 0.003463630817441021); 

locations_wk_21 = locations <3 [tool.specif ic.id : 21 , cweid: 999 , cwename : " Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location.id_1335 ( [line => 275, path => "wireshark-1 . 2 . O/epan/dissectors/packet-ssl . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 288.71437306901373], 1, 0, 0.003463630817441021); 

end; 

weakness_22 <S [id: 22 , tool_specif ic_id: 22 , cweid: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool. specif ic_id, cweid, cwename ; 

observation sequence weakness. 22 = (locations.wk.22, 1, 0, 0.004125022212036806); 

locations_wk_22 = locations <S [tool. specif ic_id:22, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_1583( [line => 1799, path => "wireshark-1 . 2 . 0/epan/packet . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 242.42293704067873], 1, 0, 0.004125022212036806); 

end; 

weakness_23 <S [id:23, tool.specif ic.id: 23, cweid:399, cwename : "Resource Management Errors (CWE399)"] 
where 
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dimension id, tool^specif ic_id, ewe id, cwename ; 

observation sequence weakness_23 = (locations_wk_23, 1, 0, 1.0); 

locations_wk_23 = locations @ [tool_specif ic_id:23, cweid:399, cwename : "Resource Management Errors (CWE399)"]; 
observation location„id_1611( [line => 345, path => "wireshark-1 . 2 . O/epan/sigcomp-udvm. c") 
textoutput=" " ; 

observation grade = ([ severity => 3, tool^specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_24 <3 [id:24, tool. specif ic_id:24, cweid:119, cwename : "Buff er Errors (CWE119)"] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_24 = (locations_wk_24, 1, 0, 1.0); 

locations_wk_24 = locations <S [tool^specif ic_id:24, cweid:119, cwename : "Buff er Errors (CWE119)"]; 
observation location„id_1611( [line => 321, path => "wireshark-1 . 2 . 0/epan/sigcomp-udvm. c") 
textoutput=" " ; 

observation grade = ([ severity => 1, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_25 <S [id: 25 , tool_specif ic_id: 25 , ewe id: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_25 = (locations_wk_25 , 1, 0, 0.001495003320843141); 

locations_wk_25 = locations <S [tool_specif ic_id:25, cweid:20, cwename : "Input Validation (CWE20) "] ; 

observation location_id_2012 ( [line => 89, path => "wireshark-1 . 2 . O/plugins/docsis/packet-bpkmreq. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 1, tool.specif ic_rank => 668.8948352542972], 1, 0, 0.001495003320843141); 

end; 

weakness_26 <S [id: 26 , tool_specif ic_id: 26 , ewe id: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic^id, ewe id, cwename ; 

observation sequence weakness_26 = (locations_wk_26 , 1, 0, 0.0014959114047375394); 
locations_wk_26 = locations <S [tool_specif ic_id:26, cweid:20, cwename : "Input Validation (CWE20) "] ; 

observation location_id_2013( [line => 90, path => "wireshark-1 . 2 . O/plugins/docsis/packet-bpkmrsp . c" ) 

textoutput=" " ; 

observation grade = ([ severity => 1, tool.specif ic.rank => 668.4887867242726], 1, 0, 0.0014959114047375394); 

end; 

weakness_27 <S [id: 27 , tool_specif ic_id: 27, ewe id: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool^specif ic_id, ewe id, cwename ; 

observation sequence weakness_27 = (locations^wk_27, 1, 0, 0.002153585613826869); 

locations_wk_27 = locations <S [tool_specif ic_id:27, cweid:20, cwename : "Input Validation (CWE20) "] ; 
observation location_id_2020( [line => 72, path => "wireshark-1 . 2 . O/plugins/docsis/packet-dsaack. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 1, tool.specif ic.rank => 464.341883405798], 1, 0, 0.002153585613826869); 

end; 

weakness_28 <S [id:28, tool.specif ic_id:28, cweid:20, cwename: "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_28 - (locations_wk_28, 1, 0, 0.00229165238295895); 

locations_wk_28 = locations @ [tool_specif ic_id:28, cweid:20, cwename : "Input Validation (CWE20) "] ; 
observation location_id_2022 C [line => 72, path => "wireshark-1 . 2 . O/plugins/docsis/packet-dsarsp. c") 
textoutput=" " ; 

observation grade = ([ severity => 1, tool.specif ic.rank => 436.36635618741343], 1, 0, 0.00229165238295895); 

end; 

weakness_29 <S [id: 29 , tool_specif ic_id: 29 , ewe id: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_29 = (locations_wk_29 , 1, 0, 0.002184230355798278); 

locations_wk_29 = locations <S [tool^specif ic_id:29, cweid:20, cwename : "Input Validation (CWE20) "] ; 
observation location_id_2023( [line => 72, path => "wireshark-1 . 2 . O/plugins/docsis/packet-dscack. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 1, tool.specif ic.rank => 457.82716889058463], 1, 0, 0.002184230355798278); 

end; 

weakness_30 <S [id: 30 , tool_specif ic_id: 30, ewe id: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness„30 = (locations_wk_30 , 1, 0, 0.0023006295251237975); 
locations_wk_30 = locations <S [tool_specif ic_id:30, cweid:20, cwename : "Input Validation (CWE20) "] ; 

observation location_id_2025 ( [line => 73, path => "wireshark-1 . 2 . O/plugins/docsis/packet-dscrsp. c" ) 

textoutput=" " ; 

observation grade = ([ severity => 1, tool.specif ic_rank => 434.6636383996635], 1, 0, 0.0023006295251237975); 

end; 

weakness_31 <S [id: 31 , tool_specif ic_id: 31 , ewe id: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_31 = (locations_wk_31 , 1, 0, 0.001897888480826156); 

locations_wk_31 = locations <S [tool_specif ic_id : 31 , cweid:20, cwename : "Input Validation (CWE20) "] ; 
observation location_id_2032 C [line => 72, path => "wireshark-1 . 2 . O/plugins/docsis/packet-regack. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 1, tool.specif ic.rank => 526.9013485790784], 1, 0, 0.001897888480826156); 

end; 

weakness_32 <S [id: 32 , tool_specif ic_id: 32 , ewe id: 20 , cwename : "Input Validation (CWE20) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_32 = (locations_wk„32 , 1, 0, 0.002216818195096963); 

locations_wk_32 = locations <S [tool_specif ic_id:32, cweid:20, cwename : "Input Validation (CWE20) "] ; 
observation location__id_2035 C [line => 73, path => "wireshark-1 . 2 . O/plugins/docsis/packet-regrsp. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 1, tool_specif ic_rank => 451.096983149879], 1, 0, 0.002216818195096963); 

end; 

weakness_33 <S [id: 33 , tool_specif ic^id: 33, ewe id: 999 , cwename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence weakness_33 - (locations_wk_33, 1, 0, 0.0028814675905206645); 
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locations_wk_33 = locations <S [tool_specif ic_id:33, cweid:999, cwename : 11 Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_2097( [line => 433, path => "wireshark-1 . 2 . 0/plugins/opcua/opcua_complextypeparser . c" ) 
textoutput=" " ; 

observation grade - ([ severity => 5, tool_specif ic_rank => 347.04537482557834], 1, 0, 0.0028814675905206645); 

end; 

weakness_34 @ [id: 34 , tool_specif ic_id: 34, ewe id: 999 , cuename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, ewe id, cwename ; 

observation sequence ueakness_34 = (locations_wk_34, 1, 0, 0.0028288900371324934); 

locations_wk_34 = locations @ [tool_specif ic_id:34, cweid:999, cwename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location„id_2107( [line => 616, path => "wireshark-1 . 2 . 0/plugins/opcua/opcua_serviceparser . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 353.49553601371184], 1, 0, 0.0028288900371324934); 

end; 

weakness_35 <3 [id: 35 , tool„specif ic_id: 35 , ewe id: 999 , cuename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool^specif ic_id, ewe id, cuename ; 

observation sequence weakness^35 = (locations_wk_35 , 1, 0, 0.003058220966230374); 

locations_wk_35 = locations <3 [tool_specif ic_id:35, cueid:999, cuename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_2110( [line => 340, path => "wireshark-1 . 2 . 0/plugins/opcua/opcua_simpletypes . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic_rank => 326.9874907805045], 1, 0, 0.003058220966230374); 

end; 

weakness_36 <S [id: 36 , tool_specif ic_id: 36 , ewe id: 999 , cuename : "Insufficient Information (NVD-CWE-noinf o) "] 
where 

dimension id, tool_specif ic_id, ewe id, cuename ; 

observation sequence ueakness.36 = (locations.wk_36 , 1, 0, 0.0018096494904338023); 

locations_wk_36 = locations <3 [tool^specif ic_id:36, cueid:999, cuename :" Insufficient Information (NVD-CWE-noinf o) "] ; 
observation location_id_2112( [line => 132, path => "wireshark-1 . 2 . 0/plugins/opcua/opcua_transport_layer . c") 
observation location_id_2112 ( [line => 169 , path => "wireshark-1 . 2 . 0/plugins/opcua/ opcua_transport_layer . c") 
observation location_id_2112 ( [line => 181 , path => "wireshark-1 . 2 . 0/plugins/opcua/ opcua_transport_layer . c") 
observation location„id_2112 ( [line => 195 , path => "wireshark-1 . 2 . 0/plugins/opcua/ opcua_transport_layer . c") 
observation location_id_2112 ( [line => 226 , path => "wireshark-1 . 2 . 0/plugins/opcua/ opcua_transport_layer . c") 
observation location_id_2112 ( [line => 250 , path => "wireshark-1 . 2 . 0/plugins/opcua/ opcua_transport_layer . c") 
textoutput=" " ; 

observation grade = ([ severity => 5, tool.specif ic.rank => 552.593198454295], 1, 0, 0.0018096494904338023); 

end; 

weakness_37 <S [id: 37 , tool_specif ic_id: 37, ewe id: 119 , cuename : "Buffer Errors (CWE119) "] 
where 

dimension id, tool_specif ic^id, ewe id, cwename ; 

observation sequence ueakness_37 = (locations_wk_37, 1, 0, 1.0); 

locations_wk_37 = locations <S [tool^specif ic_id:37, cweid:119, cuename : "Buff er Errors (CWE119)"]; 
observation location„id_2321 ( [line => 149, path => "wireshark-1 . 2 . 0/wiretap/daintree-sna. c" ) 
observation location_id_2321 ( [line => 205, path => "wireshark-1 . 2 . 0/wiretap/daintree-sna. c" ) 
textoutput=" " ; 

observation grade = ([ severity => 1, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

weakness_38 <S [id: 38 , tool_specif ic_id: 38, ewe id: 189 , cwename : "Numeric Errors (CWE189) "] 
where 

dimension id, tool^specif ic_id, ewe id, cwename ; 

observation sequence ueakness_38 = (locations_wk_38, 1, 0, 1.0); 

locations_wk_38 = locations @ [tool_specif ic_id:38, cweid:189, cuename : "Numeric Errors (CWE189)"]; 
observation location_id_2327( [line => 228, path => "wireshark-1 . 2 . 0/wiretap/erf . c" ) 
textoutput=" " ; 

observation grade = ([ severity => 2, tool_specif ic_rank => 0.0], 1, 0, 1.0); 

end; 

} 
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-cheb, [T4J [16j [24]-[26 

-cos 



Tll[Tm[THl[m[M42T 



cweid, 14-17 26 



1 


5, 


24- 


26 


, 1 


6 


24 




26 



fit, 14 16 24-26 



-flucid, 24 26 



-graph, [24 

-hamming, 14 -16, 24-26 



low, 25 26 



-mink, [14J [16j [19} [24H26 

-nopreprep 
-raw, 



T41[T71[MSd" 



nHT6l[20 



-sdwt, 24, 26 



-spectrogram, [25 



-unigram, 16 17 



Pebble, |4[ |7| 
Perl, [5] 
PHP, [3] 

Test cases 

Chrome 5.0.375.54, [|J [LOj [12J [15] [17 
Chrome 5.0.375.70, [3] 
Dovecot 1.2.0, [3] 
Dovecot 1.2.17, [3] 
Dovecot 1.2.x, [10] 
Dovecot 2. 0.beta6. 20100626, [4] 
Jetty 6.1.16, [3] 
Jetty 6.1.26, [3] 
Jetty 6.1.x, [10] 
Pebble 2.5-M2, |4l [7] ~ 



Tomcat 5.5.13, [3] [7] § [lj [lj \T7\\l9\ 
Tomcat 5.5.29, [3] [18] 
Tomcat 5.5.33, [3] [7] [8] [TUJ [18] 
Wireshark 1.2.0, [3] [TTJflJJ [16] [L8H20] [23}{26 
Wireshark 1.2.18, [3] [TUJ [l2j [TJ] [18] [19 
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Wireshark 1.2.9, § [12J [13J [18] 

Wordpress 2.0, |] 
Wordpress 2.2.3, [| 
Wordpress 2.x, [To| 



TODO,[T9 
Tomcat 

5.5.13, [3J[7J [8J [12J [HJ [T7[{19 

5.5.29, §[18] 

5.5.33, [3] [7] [8] [10] [18 
Tools 

codegen, [7] 

Wireshark 

1.2.0, § [Tl}|13j [16] [18}|20] [23 



26 



1.2.18, [3] [10] [12] [13] [THJ [19 
1.2.9, [3] [12] [13] [18] 

Wordpress 
2.0, [3] 
2.2.3, [3] 
2.x, [10] 
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